OcrV1, Main, Exploration, bibRecord, 000545

Robust named entity detection from optical character recognition output

Identifieur interne : 000545 ( Main/Exploration ); précédent : 000544; suivant : 000546

Robust named entity detection from optical character recognition output

Auteurs : Krishna Subramanian [États-Unis] ; Rohit Prasad [États-Unis] ; Prem Natarajan [États-Unis]

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2011.

RBID : Pascal:11-0343815

Descripteurs français

Pascal (Inist)
- Reconnaissance optique caractère, Extraction information, Reconnaissance caractère, Treillis, Linguistique, Langage naturel, Caractère manuscrit, Texte, Taux fausse alarme, Confiance, Multilinguisme, Arabe, Modèle Markov caché.
Wicri :
- topic : Linguistique, Multilinguisme.

English descriptors

KwdEn :
- Arabic, Character recognition, Confidence, False alarm rate, Hidden Markov model, Information extraction, Lattice, Linguistics, Manuscript character, Multilingualism, Natural language, Optical character recognition, Text.

Abstract

In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000122
to stream PascalFrancis, to step Curation: 000651
to stream PascalFrancis, to step Checkpoint: 000099
to stream Main, to step Merge: 000551
to stream Main, to step Curation: 000545

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Robust named entity detection from optical character recognition output</title>
<author><name sortKey="Subramanian, Krishna" sort="Subramanian, Krishna" uniqKey="Subramanian K" first="Krishna" last="Subramanian">Krishna Subramanian</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Prasad, Rohit" sort="Prasad, Rohit" uniqKey="Prasad R" first="Rohit" last="Prasad">Rohit Prasad</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">11-0343815</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0343815 INIST</idno>
<idno type="RBID">Pascal:11-0343815</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000122</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000651</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000099</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Subramanian K:robust:named:entity</idno>
<idno type="wicri:Area/Main/Merge">000551</idno>
<idno type="wicri:Area/Main/Curation">000545</idno>
<idno type="wicri:Area/Main/Exploration">000545</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Robust named entity detection from optical character recognition output</title>
<author><name sortKey="Subramanian, Krishna" sort="Subramanian, Krishna" uniqKey="Subramanian K" first="Krishna" last="Subramanian">Krishna Subramanian</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Prasad, Rohit" sort="Prasad, Rohit" uniqKey="Prasad R" first="Rohit" last="Prasad">Rohit Prasad</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>Raytheon BBN Technologies, 10 Moulton Street</s1>
<s2>Cambridge, MA 02138</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Massachusetts</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Arabic</term>
<term>Character recognition</term>
<term>Confidence</term>
<term>False alarm rate</term>
<term>Hidden Markov model</term>
<term>Information extraction</term>
<term>Lattice</term>
<term>Linguistics</term>
<term>Manuscript character</term>
<term>Multilingualism</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance optique caractère</term>
<term>Extraction information</term>
<term>Reconnaissance caractère</term>
<term>Treillis</term>
<term>Linguistique</term>
<term>Langage naturel</term>
<term>Caractère manuscrit</term>
<term>Texte</term>
<term>Taux fausse alarme</term>
<term>Confiance</term>
<term>Multilinguisme</term>
<term>Arabe</term>
<term>Modèle Markov caché</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Linguistique</term>
<term>Multilinguisme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Massachusetts</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Massachusetts"><name sortKey="Subramanian, Krishna" sort="Subramanian, Krishna" uniqKey="Subramanian K" first="Krishna" last="Subramanian">Krishna Subramanian</name>
</region>
<name sortKey="Natarajan, Prem" sort="Natarajan, Prem" uniqKey="Natarajan P" first="Prem" last="Natarajan">Prem Natarajan</name>
<name sortKey="Prasad, Rohit" sort="Prasad, Rohit" uniqKey="Prasad R" first="Rohit" last="Prasad">Rohit Prasad</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000545 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000545 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:11-0343815
   |texte=   Robust named entity detection from optical character recognition output
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Robust named entity detection from optical character recognition output

Robust named entity detection from optical character recognition output

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri